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Abstract 

This report introduces general ideas and some basic methods of the Bayesian probabihty 
theory apphed to physics measurements. Our aim is to make the reader famiUar, through ex- 
amples rather than rigorous formalism, with concepts such as: model comparison (including 
the automatic Ockham's i?azor filter provided by the Bayesian approach); parametric infer- 
ence; quantification of the uncertainty about the value of physical quantities, also taking 
into account systematic effects; role of marginalization; posterior characterization; predic- 
tive distributions; hierarchical modelling and hyperparameters; Gaussian approximation of 
the posterior and recovery of conventional methods, especially maximum likelihood and chi- 
square fits under well defined conditions; conjugate priors, transformation invariance and 
maximum entropy motivated priors; Monte Carlo estimates of expectation, including a short 
introduction to Markov Chain Monte Carlo methods. " 

1 Introduction 

The last decades of 20**^ century have seen an intense expansion in the use of Bayesian methods in 
all fields of human activity that generally deal with uncertainty, including engineering, computer 
science, economics, medicine and even forensics (Kadane and Schum 1996). Bayesian networks 
(Pearl 1988, Cowell et al 1999) are used to diagrammatically represent uncertainty in expert 
systems or to construct artificial intelligence systems. Even venerable metrological associations, 
such as the International Organization for Standardization (ISO 1993), the Deutsches Institut 
fiir Normung (DIN 1996, 1999), and the USA National Institute of Standards and Technology 
(Taylor and Kuyatt 1994), have come to realize that Bayesian ideas are essential to provide 
general methods for quantifying uncertainty in measurements. A short account of the Bayesian 
upsurge can be found in a Science article (Malakoff 1999). A search on the web for the keywords 
'Bayesian,' 'Bayesian network,' or 'belief network' gives one a dramatic impression of this 'rev- 
olution,' not only in terms of improved methods, but more importantly in terms of reasoning. 
An overview of recent developments in Bayesian statistics, may be found in the proceedings of 
the Valencia Conference series. The last published volume was (Bernardo et al 1999), and the 
most recent conference was held in June 2002. Another series of workshops, under the title 
of Maximum Entropy and Bayesian Methods, has focused more on applications in the physical 
sciences. 

It is surprising that many physicists have been slow to adopt these 'new' ideas. There have 
been notable exceptions, of course, many of whom have contributed to the abovementioned 
Maximum Entropy workshops. One reason to be surprised is because numerous great physicists 
and mathematicians have played important roles in developing probability theory. These 'new' 
ideas actually originated long ago with Bernoulli, Laplace, and Gauss, just to mention a few 
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who contributed significantly to the development of physics, as well as to Bayesian thinking. 
So, while modern statisticians and mathematicians are developing powerful methods to apply 
to Bayesian analysis, most physicists, in their use and reasoning in statistics still rely on 20*^ 
century 'frequentist prescriptions' (D'Agostini 1999a, 2000). 

We hope that this report will help fill this gap by reviewing the advantages of using the 
Bayesian approach to address physics problems. We will emphasize more the intuitive and 
practical aspects than the theoretical ones. We will not try to cover all possible applications of 
Bayesian analysis in physics, but mainly concentrate on some basic applications that illustrate 
clearly the power of the method and how naturally it meshes with physicists' approach to their 
science. 

The vocabulary, expressions, and examples have been chosen with the intent to correspond, 
as closely as possible, to the education that physicists receive in statistics, instead of a more 
rigorous approach that formal Bayesian statisticians might prefer. For example, we avoid many 
important theoretical concepts, like exchangeability, and do not attempt to prove the basic rules 
of probability. When we talk about 'random variables,' we will in fact mean 'uncertain variables,' 
and instead of referring to the frequentist concept of 'randomness' a la von Mises (1957). This 
distinction will be clarified later. 

In the past, presentations on Bayesian probability theory often start with criticisms of 'con- 
ventional,' that is, frequentist ideas, methods, and results. We shall keep criticisms and detailed 
comparisons of the results of different methods to a minimum. Readers interested in a critical 
review of conventional frequentist statistics will find a large literature, because most introduc- 
tory books or reports on Bayesian analysis contain enough material on this matter. See (Gelman 
(it al 1995, Sivia 1997, D'Agostini 1999c, Jaynes 1998, Loredo 1990) and the references therein. 
Eloquent 'defenses of the Bayesian choice' can be found in (Howson and Urbach 1993, Robert 
2001). 

Some readers may wish to have references to unbiased comparisons of frequentist to Bayesian 
ideas and methods. To our knowledge, no such reports exist. Those who claim to be impartial 
are often frequentists who take some Bayesian results as if they were frequentist 'prescriptions,' 
not caring whether all underlying hypotheses apply. For two prominent papers of this kind, 
see the articles by Efron (1986a) [with follow up discussions by Lindley (1989), Zellner (1986), 
and Efron (1986b)] and Cousins (1995). A recent, pragmatic comparisons of frequentist and 
Bayesian confidence limits can be found in (Zech 2002). 

Despite its lack of wide-spread use in physics, and its complete absence in physics courses 
(D'Agostini 1999a), Bayesian data analysis is increasingly being employed in many areas of 
physics, for example, in astronomy (Gregory and Loredo 1992, 1996, Gregory 1999, Babu and 
Feigelson 1992, 1997, Bontekoe et al 1994), in geophysics (Glimm and Sharp 1999), in high- 
energy physics (D'Agostini and Degrassi 1999, Ciuchini et al 2001), in image reconstruction 
(Hanson 1993), in microscopy (Higdon and Yamamoto 2001), in quantum Monte Carlo (Guber- 
natis et al 1991), and in spectroscopy (Skilling 1992, Fischer et al 1997, 1998, 2000), just to 
mention a few articles written in the last decade. Other examples will be cited throughout the 
paper. 

2 Uncertainty and probability 

In the practice of science, we constantly find ourselves in a state of uncertainty. Uncertainty 
about the data that an experiment shall yield. Uncertainty about the true value of a physical 
quantity, even after an experiment has been done. Uncertainty about model parameters, cali- 
bration constants, and other quantities that might influence the outcome of the experiment, and 
hence influence our conclusions about the quantities of interest, or the models that might have 
produced the observed results. 
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In general, we know through experience that not all the events that could happen, or all 
conceivable hypotheses, are equally likely. Let us consider the outcome of you measuring the 
temperature at the location where you are presently reading this paper, assuming you use a 
digital thermometer with one degree resolution (or you round the reading at the degree if you 
have a more precise instrument). There are some values of the thermometer display you are 
more confident to read, others you expect less, and extremes you do not believe at all (some 
of them are simply excluded by the thermometer you are going to use). Given two events Ei 
and E2, for example Ei : "T = 22° C" and E2 ■ "T = 33° C", you might consider E2 much more 
probable than Ei, just meaning that you believe E2 to happen more than Ei. We could use 
different expressions to mean exactly the same thing: you consider E2 more likely; you are more 
confident in E2; having to choose between Ei and E2 to win a price, you would promptly choose 
E2; having to classify with a number, that we shall denote with P, your degree of confidence on 
the two outcomes, you would write P{E2) > P{Ei); and many others. 

On the other hand, we would rather state the opposite, i.e. P{Ei) > ^(£"2), with the same 
meaning of symbols and referring exactly to the same events: what you are going to read at your 
place with your thermometer. The reason is simply because we do not share the same status of 
information. We do not know who you are and where you are in this very moment. You and 
we are uncertain about the same event, but in a different way. Values that might appear very 
probable to you now, appear quite improbable, though not impossible, to us. 

In this example we have introduced two crucial aspects of the Bayesian approach: 

1. As it is used in everyday language, the term probability has the intuitive meaning of "the 
degree of belief that an event will occur. " 

2. Probability depends on our state of knowledge, which is usually different for different 
people. In other words, probability is unavoidably subjective. 

At this point, you might find all of this quite natural, and wonder why these intuitive concepts 
go by the esoteric name 'Bayesian.' We agree! The fact is that the main thrust of statistics theory 
and practice during the 20*^ century has been based on a different concept of probability, in 
which it is defined as the limit of the long-term relative frequency of the outcome of these events. 
It revolves around the theoretical notion of infinite ensembles of 'identical experiments.' Without 
entering an unavoidably long critical discussion of the frequentist approach, we simply want to 
point out that in such a framework, there is no way to introduce the probability of hypotheses. 
All practical methods to overcome this deficiency yield misleading, and even absurd, conclusions. 
See (D'Agostini 1999c) for several examples and also for a justification of why frequentistic test 
'often work'. 

Instead, if we recover the intuitive concept of probability, we are able to talk in a natural 
way about the probability of any kind of event, or, extending the concept, of any proposition. In 
particular, the probability evaluation based on the relative frequency of similar events occurred 
in the past is easily recovered in the Bayesian theory, under precise condition of validity (see 
Sect. 15.3")) . Moreover, a simple theorem from probability theory, Bayes' theorem, which we shall 
see in the next section, allows us to update probabilities on the basis of new information. This 
inferential use of Bayes' theorem is only possible if probability is understood in terms of degree 
of belief. Therefore, the terms 'Bayesian' and 'based on subjective probability' are practically 
synonyms, and usually mean 'in contrast to the frequentist, or conventional, statistics.' The 
terms 'Bayesian' and 'subjective' should be considered transitional. In fact, there is already the 
tendency among many Bayesians to simply refer to 'probabilistic methods,' and so on (Jeffreys 
1961, de Finetti 1974, Jaynes 1998 and Cowell et al 1999). 

As mentioned above, Bayes' theorem plays a fundamental role in the probability theory. 
This means that subjective probabilities of logically connected events are related to each other 
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by mathematical rules. This important result can be summed up by saying, in practical terms, 
that 'degrees of belief follow the same grammar as abstract axiomatic probabilities.' Hence, all 
formal properties and theorems from probability theory follow. 

Within the Bayesian school, there is no single way to derive the basic rules of probability (note 
that they are not simply taken as axioms in this approach), de Finetti's principle of coherence 
(de Finetti 1974) is considered the best guidance by many leading Bayesians (Bernardo and 
Smith 1994, O'Hagan 1994, Lad 1996 and Colctti and Scozzafava 2002). Sec (D'Agostini 1999c) 
for an informal introduction to the concept of coherence, which in simple words can be outlined 
as follows. A person who evaluates probability values should be ready to accepts bets in either 
direction, with odd ratios calculated from those values of probability. For example, an analyst 
that declares to be confident 50% on E should be aware that somebody could ask him to make 
a 1:1 bet on £^ or on £'. If he/she feels uneasy, it means that he/she does not consider the two 
events equally likely and the 50% was 'incoherent.' 

Others, in particular practitioners close to the Jaynes' Maximum Entropy school (Jaynes 
1957a, 1957b) feel more at ease with Cox's logical consistency reasoning, requiring some con- 
sistency properties ('desiderata') between values of probability related to logically connected 
propositions. (Cox 1946). See also (Jaynes 1998, Sivia 1997, and Frohner 2000, and especially 
Tribus 1969), for accurate derivations and a clear account of the meaning and role of information 
entropy in data analysis. An approach similar to Cox's is followed by Jeffreys (1961), another 
leading figure who has contributed a new vitality to the methods based on this 'new' point 
of view on probability. Note that Cox and Jeffreys were physicists. Remarkably, Schrodinger 
(1947a, 1947b) also arrived at similar conclusions, though his definition of event is closer to the 
dc Finetti's one. [Some short quotations from (Schrodinger 1947a) are in order. Definition of 
probability: "... a quantitative m,easure of the strength of our conjecture or anticipation, founded 
on the said knowledge, that the event comes true". Subjective nature of probability: ^^Since the 
knowledge may be different with different persons or with the same person at different times, 
they may anticipate the same event with more or less confidence, and thus different numerical 
probabilities may be attached to the same event. " Conditional probability: "Thus whenever we 
speak loosely of 'the probability of an event, ' it is always to be understood: probability with regard 
to a certain given state of knowledge. "] 

3 Rules of probability 

We begin by stating some basic rules and general properties that form the 'grammar' of the 
probabilistic language, which is used in Bayesian analysis. In this section, we review the rules 
of probability, starting with the rules for simple propositions. We will not provide rigorous 
derivations and will not address the foundational or philosophical aspects of probability theory. 
Moreover, following an 'eclectic' approach which is common among Bayesian practitioners, we 
talk indifferently about probability of events, probability of hypotheses or probability of propo- 
sitions. Indeed, the last expression will be often favoured, understanding that it does include 
the others. 

3.1 Probability of simple propositions 

Let us start by recalling the basic rules of probability for propositions or hypotheses. Let A and 
B be propositions, which can take on only two values, for example, true or false. The notation 
P{A) stands for the probability that A is true. The elementary rules of probability for simple 
propositions are 

< P{A) < 1; (1) 
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Pm = 1; (2) 

P{AuB) = P{A)+P{B)-P{AnB). (3) 
P{A nB) = P{A I B) P{B) = P{B I A) P{A) , (4) 

where Q means tautology (a proposition that is certainly true). The construct ACiB is true only 
when both A and B are true (logical AND), while AD B is true when at least one of the two 
propositions is true (logical OR). AnB is also written simply as 'A, B^ or AB, and is also called 
a logical product, while ^4 U i? is also called a logical sum. P{A, B) is called the joint probability 
of A and B. P{A \ B) is the probability of A under that condition that B is true. We often read 
it simply as "the probability of A, given B." . 

Equation (jlj shows that the joint probability of two events can be decomposed into con- 
ditional probabilities in different two ways. Either of these ways is called the product rule. 
If the status of B does not change the probability of A, and the other way around, then A 
and B are said to be independent, probabilistically independent to be precise. In that case, 
P{A I B) = P{A), and P{B \ A) = P{B), which, when inserted in Eq. (jlj), yields 

P{A n B) = P{A) P{B) <^=^ probabilistic independence . (5) 

Equations logically lead to other rules which form the body of probability theory. For 

example, indicating the negation (or opposite) of A with A, clearly AuAis a tautology (AuA = 
0,), and A n A is a contradiction {ADA = 0). The symbol stands for contradiction (a 
proposition that is certainly false). Hence, we obtain from Eqs. © and (jH)) 

PiA) + P(A) = 1, (6) 

which says that proposition A is either true or not true. 



3.2 Probability of complete classes 

These formulae become more interesting when we consider a set of propositions Hj that all 
together form a tautology (i.e., they are exhaustive) and are mutually exclusive. Formally 

UiHj = n (7) 

HjnHk = if i / A; . (8) 

When these conditions apply, the set {Hj} is said to form a complete class. The symbol H has 
been chosen because we shall soon interpret {Hj} as a set of hypotheses. 
The first (trivial) property of a complete class is normalization, that is 

Em) = 1. (9) 

which is just an extension of Eq. Q to a complete class containing more than just a single 
proposition and its negation. 

For the complete class H, the generalizations of Eqs. © and the use of Eq. (jlJ) yield: 

P{A) = Y.P(^^H,) (10) 

j 

P{A) = Y.P{A\H,)P{H,). (11) 

j 

Equation (|l()j) is the basis of what is called marginalization, which will become particularly 
important when dealing with uncertain variables: the probability of A is obtained by the sum- 
mation over all possible constituents contained in A. Hereafter, we avoid explicitly writing the 
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limits of the summations, meaning that they extend over ah elements of the class. The con- 
stituents are ^A,Hj,^ which, based on the complete class of hypotheses {H}, themselves form 
a complete class, which can be easily proved. Equation shows that the probability of any 
proposition is given by a weighted average of all conditional probabilities, subject to hypotheses 
Hj forming a complete class, with the weight being the probability of the hypothesis. 

In general, there are many ways to choose complete classes (like 'bases' in geometrical spaces). 
Let us denote the elements of a second complete class by Ei. The constituents are then formed 
by the elements {Ei,Hj) of the Cartesian product {E} x {H}. Equations (jlUj) and then 
become the more general statements 

Pm = Y.PiE^.H,) (12) 

P{Ei) = Y.nE^\H,)P{H,) (13) 
j 

and, symmetrically, 

P{H,) = Y.nEr,H,) (14) 

i 

P{H,) = Y.PiHj\Ei)P{Ei). (15) 

i 

The reason we write these formulae both ways is to stress the symmetry of Bayesian reasoning 
with respect to classes {E} and {-fT}, though we shall soon associate them with observations (or 
events) and hypotheses, respectively. 



3.3 Probability rules for uncertain variables 

In analyzing the data from physics experiments, we need to deal with measurement that are 
discrete or continuous in nature. Our aim is to make inferences about the models that we believe 
appropriately describe the physical situation, and/or, within a given model, to determine the 
values of relevant physics quantities. Thus, we need the probability rules that apply to uncertain 
variables, whether they are discrete or continuous. The rules for complete classes described in 
the preceding section clearly apply directly to discrete variables. With only slight changes, the 
same rules also apply to continuous variables because they may be thought of as a limit of 
discrete variables, as interval between possible discrete values goes to zero. 

For a discrete variable x, the expression p{x), which is called a probability function, has 
the interpretation in terms of the probability of the proposition P{A), where A is true when 
the value of the variable is equal to x. In the case of continuous variables, we use the same 
notation, but with the meaning of a probability density function (pdf). So p{x) dx, in terms of 
a proposition, is the probability P{A), where A is true when the value of the variable lies in 
the range of x to a; + dx. In general, the meaning is clear from the context; otherwise it should 
be stated. Probabilities involving more than one variable, like p{x, y), have the meaning of the 
probability of a logical product; they are usually called joint probabilities. 

Tabled summarizes useful formulae for discrete and continuous variables. The interpretation 
and use of these relations in Bayesian inference will be illustrated in the following sections. 



4 Bayesian inference for simple problems 

We introduce the basic concepts of Bayesian inference by considering some simple problems. 
The aim is to illustrate some of the notions that form the foundation of Bayesian reasoning. 
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Table 1: Some deEnitions and properties of prohahility functions for values of a discrete variable 
Xi and probability density functions for continuous variables x. All summations and integrals 
are understood to extend over the full range of possibilities of the variable. Note that the 
expectation of the variable is also called expected value (sometimes expectation value), average 
and mean. The square root of the variance is the standard deviation a. 

discrete variables continuous variables 

probability P{X = Xi) = p{xi) ^P[x<x<x+dx] = P(^) 

normalization^ J2iP{xi) = 1 Jp{x) dx = l 

expectation of f{X) E[/(X)] = f{xi) p{xi) E[/(X)] = //(x) p{x) dx 

expected value E(X) = Xjp(xj) ¥j{X) = J x p{x) dx 

moment of order r M.r{X) = J2iXiP{xi) Mr{X) = Jx"^ p{x)dx 

variance = - E{X)]'^ p{xi) = J[x - E{X)]'^ p{x) dx 

product rule p{xi, yj) = p{xi \ yj)p{yj) p{x, y) = p{x \ y) p{y) 

independence p{xi,yj) = p{xi) p{yj) p{x,y) = p{x) p{y) 

marginalization J2j P{xi,yj) = pixi) Jp{x,y) dy = p{x) 

decomposition p{xi) = Y.j p{xi \yj) p{yj) p{x) = Jp{x \ y) p{y) dy 

Bayes' theorem pix^ \ yi) = ^(^; I pl^ I y) = p{y\x)p{x) 

y y\ 3\ yi) ^(y. | Xj)p{xj) ' ^' J p{y \ x) p{x) dx 

likelihood C{xj ; yi) = p{yi \ xj) C{x ; y) = p{y | x) 

tA function p{x) such that '^^p{xi) = oo, or Jp{x)dx = oo, is called improper. Improper 
functions are often used to describe relative beliefs about the possible values of a variable. 
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4.1 Background information 



As we think about drawing conclusions about the physical world, we come to realize that ev- 
erything we do is based on what we know about the world. Conclusions about hypotheses will 
be based on our general background knowledge. To emphasize the dependence of probability on 
the state of background information, which we designate as I, we will make it explicit by writing 
P{E 1 1), rather than simply P{E). (Note that, in general, P{A \ Ii) / P{A \ I2), if h and I2 are 
different states of information.) For example, Eq. Q should be more precisely written as 

P{A nB\I)= P{A \BnI) P{B I /) = P{B \Ar\I) P{A I /) , (16) 

or alternatively as 

P{A, B\I) = P{A I B, I) P{B I /) = P{B I A, I) P{A \ I) . (17) 

We have explicitly included / as part of the conditional to remember that any probability relation 
is valid only under the same state of background information. 



4.2 Bayes' theorem 

Formally, Bayes' theorem follows from the symmetry of P{A, B) expressed by Eq. ()17|). In terms 
of Ei and Hj belonging to two different complete classes, Eq. (fTTj) yields 

P{H,\E,J) ^ P{E,\H,J) 
P{Hj\I) P{E,\I) ^ 

This equation says that the new condition Ei alters our belief in Hj by the same updating factor 
by which the condition Hj alters our belief about Ei . Rearrangement yields Bayes ' theorem 

P(H I F n - PiE^\H„I)PiH,\I) 

P{H,\E,,I)- . (19) 

We have obtained a logical rule to update our beliefs on the basis of new conditions. Note that, 
though Bayes' theorem is a direct consequence of the basic rules of axiomatic probability theory, 
its updating power can only be fully exploited if we can treat on the same basis expressions 
concerning hypotheses and observations, causes and effects, models and data. 

In most practical cases, the evaluation of P{Ei \ I) can be quite difficult, while determining 
the conditional probability P{Ei \ Hj, I) might be easier. For example, think of Ei as the prob- 
ability of observing a particular event topology in a particle physics experiment, compared with 
the probability of the same thing given a value of the hypothesized particle mass (Hj), a given 
detector, background conditions, etc. Therefore, it is convenient to rewrite P{Ei \ I) in Eq. H19|l 
in terms of the quantities in the numerator, using Eq. (|13jl . to obtain 



P{Ei\Hj,I)P{H,\I) 
Y.jP{Ei\Hj,I)P{H,\I) 



which is the better-known form of Bayes' theorem. Written this way, it becomes evident that 
the denominator of the r.h.s. of Eq. 1)201) is just a normalization factor and we can focus on just 
the numerator: 

P{Hj\E,,I) oc P{Ei\Hj,I)P{Hj\I). (21) 

In words 

posterior oc likelihood x prior , (22) 
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where the posterior (or final state) stands for the probabihty of Hj, based on the new observation 
Ei, relative to the prior (or initial) probabihty. (Prior probabihties are often indicated with Pq.) 
The conditional probability P{Ei \ Hj) is called the likelihood. It is literally the probability of the 
observation Ei given the specific hypothesis Hj. The term likelihood can lead to some confusion, 
because it is often misunderstood to mean "the likelihood that Ei comes from Hj." However, 
this name implies to consider P{Ei \Hj) a mathematical function of Hj for a fixed Ei and in 
that framework it is usually written as C{Hj;Ei) to emphasize the functionality. We caution 
the reader that one sometimes even finds the notation C{Ei \ Hj) to indicate exactly P{Ei \ Hj). 

4.3 Inference for simple hypotheses 

Making use of formulae (|2flj) or (|21j) . we can easily solve many classical problems involving 
inference when many hypotheses can produce the same single effect. Consider the case of 
interpreting the results of a test for the HIV virus applied to a randomly chosen European. 
Clinical tests are very seldom perfect. Suppose that the test accurately detects infection, but 
has a false-positive rate of 0.2%: 



If the test is positive, can we conclude that the particular person is infected with a probability 
of 99.8% because the test has only a 0.2% chance of mistake? Certainly not! This kind of 
mistake is often made by those who are not used to Bayesian reasoning, including scientists who 
make inferences in their own field of expertise. The correct answer depends on what we else 
know about the person tested, that is, the background information. Thus, we have to consider 
the incidence of the HIV virus in Europe, and possibly, information about the lifestyle of the 
individual. For details, see (D'Agostini 1999c). 

To better understand the updating mechanism, let us take the ratio of Eq. (|2()|) for two 
hypotheses Hj and if^ 



where the sums in the denominators of Eq. H20|) cancel. It is convenient to interpret the ratio 
of probabilities, given the same condition, as betting odds. This is best done formally in the de 
Finetti approach, but the basic idea is what everyone is used to: the amount of money that 
one is willing to bet on an event is proportional to the degree to which one expects that event 
will happen. Equation H23|) tells us that, when new information is available, the initial odds are 
updated by the ratio of the likelihoods P{Ei \ Hj, I)/ P{Ei \ H^,!), which is known as the Bayes 
factor. 

In the case of the HIV test, the initial odds for an arbitrarily chosen European to be in- 
fected P{Hj I /) /P{Hk I /) are so small that we need a very high Bayes' factor to be reasonably 
certain that, when the test is positive, the person is really infected. With the numbers used 
in this example, the Bayes factor is 500 = 1/0.002. For example, if we take for the prior 
Po (Infected) /Pq (Infected) = 1/1000, the Bayes' factor changes these odds to 500/1000 = 1/2, 
or equivalently, the probability that the person is infected would be 1/3, quite different from 
the 99.8% answer usually prompted by those who have a standard statistical education. This 
example can be translated straightforwardly to physical problems, like particle identification in 
the analysis of a Cherenkov detector data, as done, e.g. in (D'Agostini 1999c). 



P(Positive I Infected) 



1 , and ^(Positive | Infected) = 0.2% . 



P{Hj\E,J) 
P{Hk\Ei,I) 



P{Ei\Hj,I) P{Hj\I) 
P{Ei\HkJ) P{Hk\I) 



(23) 
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5 Inferring numerical values of physics quantities — General 
ideas and basic examples 

In physics we are concerned about models ('theories') and the numerical values of physical 
quantities related to them. Models and the value of quantities are, generally speaking, the 
hypothesis we want to infer, given the observations. In the previous section we have learned 
how to deal with simple hypotheses, 'simple' in the sense that they do not depend on internal 
parameters. 

On the other hand, in many applications we have strong beliefs about what model to use 
to interpret the measurements. Thus, we focus our attention on the model parameters, which 
we consider as uncertain variables that we want to infer. The method which deals with these 
applications is usually referred as parametric inference, and it will be shown with examples in 
this section. In our models, the value of the relevant physical quantities are usually described 
in terms of a continuous uncertain variable. Bayes' theorem, properly extended to uncertain 
quantities (see Tab^, plays a central role in this inference process. 

A more complicate case is when we are also uncertain about the model (and each possible 
model has its own set of parameter, usually associated with different physics quantities). We 
shall analyse this problem in Sect. [71 



5.1 Bayesian inference on uncertain variables and posterior characterization 

We start here with a few one-dimensional problems involving simple models that often occur in 
data analysis. These examples will be used to illustrate some of the most important Bayesian 
concepts. Let us first introduce briefly the structure of the Bayes' theorem in the form convenient 
to our purpose, as a straightforward extension of what was seen in Sect. 14.21 

= Jp{d\9J)p{9\I)d9- 

9 is the generic name of the parameter (used hereafter, unless the models have traditional symbols 
for their parameters) and d is the data point. p{9 \ I) is the prior, p(9 \ d, I) the posterior and 
p{d I 9, 1) the likelihood. Also in this case the likelihood is is often written as £,{9; d) = p{d | 0, /), 
and the same words of caution expressed in Sect. apply here too. Note, moreover, that, while 
p{d I 9, 1) is a properly normalized pdf, C{9; d) has not a pdf meaning in the variable 9. Hence, 
the integral of C{9; d) over 9 is only accidentally equal to unity. The denominator in the r.h.s. 
of Eq. ()24() is called the evidence and, while in the parametric inference discussed here is just a 
trivial normalization factor, its value becomes important for model comparison (see Sect. Ej). 

Posterior probability distributions provide the full description of our state of knowledge about 
the value of the quantity. In fact, they allow to calculate all probability intervals of interest. Such 
intervals are also called credible intervals (at a specified level of probability, for example 95%) or 
confidence intervals (at a specified level of 'confidence', i.e. of probability). However, the latter 
expression could be confused with frequentistic 'confidence intervals', that are not probabilistic 
statements about uncertain variables (D'Agostini 1999c). 

It is often desirable to characterize the distribution in terms of a few numbers. For example, 
mean value (arithmetic average) of the posterior, or its most probable value (the mode) of 
the posterior, also known as the maximum a posteriori (MAP) estimate. The spread of the 
distribution is often described in terms of its standard deviation (square root of the variance). 
It is useful to associate the terms mean value and standard deviation with the more inferential 
terms expected value, or simply expectation (value), indicated by E(), and standard uncertainty 
(ISO 1993), indicated by cr(), where the argument is the uncertain variable of interest. This 
will be our standard way of reporting the result of inference in a quantitative way, though, we 
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emphasize that the full answer is given by the posterior distribution, and reporting only these 
summaries in case of the complex distributions (e.g. multimodal and/or asymmetrical pdf's) 
can be misleading, because people tend to think of a Gaussian model if no further information 
is provided. 



5.2 Gaussian model 

Let us start with a classical example in which the response signal d from a detector is described 
by a Gaussian error function around the true value ^ with a standard deviation a, which is 
assumed to be exactly known. This model is the best-known among physicists and, indeed, the 
Gaussian pdf is also known as normal because it is often assumed that errors are 'normally' 
distributed according to this function. Applying Bayes' theorem for continuous variables (see 
Tab.^, from the likelihood 



p{d\fi,I) 



2ira 



exp 



2(72 



(25) 



we get for /i 



p(/x|d, /) 



2ira 



exp 



(rf-M)' 

2(72 



p(/i I /) 



27r(7 



exp 



2(72 



(26) 



pin I /) d/x 



Considering all values of fj, equally likely over a very large interval, we can model the prior 
p{p I /) with a constant, which simplifies in Eq. (|26|1 . yielding 



p{fi\d, I) 



1 



2ira 



exp 



2(72 



(27) 



Expectation and standard deviation of the posterior distribution are E(/i) = d and (7(^) = a, 
respectively. This particular result corresponds to what is often done intuitively in practice. 
But one has to pay attention to the assumed conditions under which the result is logically valid: 
Gaussian likelihood and uniform prior. Moreover, we can speak about the probability of true 
values only in the subjective sense. It is recognized that physicists, and scientists in general, are 
highly confused about this point (D'Agostini 1999a). 

A noteworthy case of a prior for which the naive inversion gives paradoxical results is when 
the value of a quantity is constrained to be in the 'physical region,' for example /x > 0, while 
d falls outside it (or it is at its edge). The simplest prior that cures the problem is a step 
function 9{p,), and the result is equivalent to simply renormalizing the pdf in the physical region 
(this result corresponds to a 'prescription' sometimes used by practitioners with a frequentist 
background when they encounter this kind of problem). 

Another interesting case is when the prior knowledge can be modeled with a Gaussian func- 
tion, for example, describing our knowledge from a previous inference 



p{fi\fj,o,cro,I) 



Inserting Eq. into Eq. I^^ . we get 



p{lJ,\d, p.Q,aQ,I) 



27r (7o 



exp 



27r (7i 



exp 



2al 



2(72 



(28) 



(29) 
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where 



+ l/al 



/^i=E(^) = ;l:7;j (30) 



^2 ^2 ^2 



«+^— — 2 /"o = -o /^o (31) 



-1 



Var(^) = + (32) 



We can then see that the case 1 1) = constant corresponds to the hmit of a Gaussian prior with 
very large ctq and finite //q- The formula for the expected value combining previous knowledge 
and present experimental information has been written in several ways in Eq. H31|) . 

Another enlighting way of writing Eq. ()3Up is considering /xq and fii the estimates of ^ at 
times to and ti, respectively before and after the observation d happened at time ti. Indicating 
the estimates at different times by /i(t), we can rewrite Eq. (|3U() as 

Mil) = 2,/;^l, , Mii)+ 2,,flH,, , Mto) 

= M*o) + ^tt^^Httt K^i) - AM (33) 

= /i(io)+i^(ti)Kii)-A(io)] (34) 
^^m(*i) = c7lito)-K{ti)al{to), (35) 

where 

c^d(*i) +^M*o) 

Indeed, we have given Eq. (|3U|) the structure of a Kalman filter (Kalman 1960). The new obser- 
vation 'corrects' the estimate by a quantity given by the innovation (or residual) [d{ti) — /i(to)] 
times the blending factor (or gain) K[ti). For an introduction about Kalman filter and its 
probabilistic origin, see (Maybeck 1979 and Welch and Bishop 2002). 

As Eqs. H31|) - H35() show, a new experimental information reduces the uncertainty. But this 
is true as long the previous information and the observation are somewhat consistent. If we are, 
for several reasons, sceptical about the model which yields the combination rule H31 |) - (|32() . we 
need to remodel the problem and introduce possible systematic errors or underestimations of 
the quoted standard deviations, as done e.g. in (Press 1997, Dose and von der Linden 1999, 
D'Agostini 1999b, Frohner 2000). 



5.3 Binomial model 

In a large class of experiments, the observations consist of counts, that is, a number of things 
(events, occurrences, etc.). In many processes of physics interests the resulting number of counts 
is described probabilistically by a binomial or a Poisson model. For example, we want to draw 
an inference about the efficiency of a detector, a branching ratio in a particle decay or a rate 
from a measured number of counts in a given interval of time. 

The binomial distribution describes the probability of randomly obtaining n events ('suc- 
cesses') in independent trials, in each of which we assume the same probability 9 that the 
event will happen. The probability function is 

p(n|6',A^) = ( ^ ) r(l-e)^-", (37) 
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p(e|n,N) 




Figure 1: Posterior probability density function of the binomial parameter 9, having observed n 
successes in trials. 



where the leading factor is the well-known binomial coefficient, namely N\/nl{N — n)!. We wish 
to infer 6 from an observed number of counts n in trials. Incidentally, that was the "problem 
in the doctrine of chances" originally treated by Bayes (1763), reproduced e.g. in (Press 1992). 
Assuming a uniform prior for 6, by Bayes' theorem the posterior distribution for 9 is proportional 
to the likelihood, given by Eq. (|37jl : 

on (-1 a\N~n 

p{9\n,NJ) = / ^ ^ (38) 

^ ' ^ J^9^{l-9)^-"d9 



n! (AT - n)! 

Some examples of this distribution for various values of n and are shown in Fig. ^ Expec- 
tation, variance, and mode of this distribution are: 

E{9) = — ^ (40) 

^ (n + l){N-n + l) _ E{9) (l-E(g)) 
(A^ + 3)(A^ + 2)2 A^ + 3 

(42) 



n 
N 



where the mode has been indicated with 9xn- Equation H4U|) is known as the Laplace formula. For 
large values of A^ and <ti n <ti N the expectation of 9 tends to 9^^, and p{9) becomes approxi- 
mately Gaussian. This result is nothing but a reflection of the well-known asymptotic Gaussian 
behavior ofp{n \ 9,N). For large A^ the uncertainty about 9 goes like 1/^/N. Asymptotically, we 
are practically certain that 9 is equal to the relative frequency of that class of events observed 
in the past. This is how the frequency based evaluation of probability is promptly recovered in 
the Bayesian approach, under well defined assumptions. 
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Figure 2: The posterior distribution for the Poisson parameter A, when n counts are observed 
in an experiment. 

5.4 Poisson model 

The Poisson distribution gives the probabihty of observing n counts in a fixed time interval, 
when the expectation of the number of counts to be observed is A: 

A"e~^ 

P{n\X) = (43) 

The inverse problem is to infer A from n counts observed. (Note that what physically matters 
is the rate r = A/ AT, where AT is the observation time.) Applying Bayes' theorem and using 
a uniform prior p{X \ I) for A, we get 

A" e-^ 

p{X I n, /) = = . (44) 

°° A e n! 



dA 



n! 

As for the Gaussian model, the same mathematical expression holds for the likelihood, but with 
interchanged role of variable and parameter. Expectation and variance of A are both equal 
to n + 1, while the most probable value is Am = n. For large n, the extra '+1' (due to the 
asymmetry of the prior with respect to A = 0) can be ignored and we have E(A) = o"^(A) ~ n 
and, once again, the uncertainty about A follows a Gaussian model. The relative uncertainty on 
A decreases as Yj^/n. 

When the observed value of n is zero, Eq. H44|) yields p(A | n = 0) = e"'*', giving a maximum 
of belief at zero, but an exponential tail toward large values of A. Expected value and standard 
deviation of A are both equal to 1. The 95% probabilistic upper bound of A is at Ags^jy^ = 3, 
as it can be easily calculated solving the equation /o^'"*^^°^^p(A | n = 0) dA = 0.95. Note that also 
this result depends on the choice of prior, though Astone and D'Agostini (1999) have shown 
that the upper bound is insensitive to the exact form of the prior, if the prior models somehow 
what they call "positive attitude of rational scientists" (the prior has not to be in contradiction 
with what one could actually observe, given the detector sensitivity). In particular, they show 
that a uniform prior is a good practical choice to model this attitude. On the other hand. 
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talking about 'objective' probabilistic upper/lower limits makes no sense, as discussed in detail 
and with examples in the cited paper: one can at most speak about conventionally defined 
non-probabilistic sensitivity bounds, which separate the measurement region from that in which 
experimental sensitivity is lost (Astone and D'Agostini 1999, D'Agostini 2000, Astone et al 
2002). 



5.5 Inference from a data set and sequential use of Bayes formula 

In the elementary examples shown above, the inference has been done from a single data point 
d. If we have a set of observations (data), indicated by d, we just need to insert in the Bayes 
formula the likelihood p{d \ 9,1), where this expression indicates a multi-dimensional joint pdf. 

Note that we could think of inferring 9 on the basis of each newly observed datum dj. After 
the one observation: 

p{9\di,I) o^p{di\9,I)p{9\I) (45) 

and after the second: 

p{9\di,d2,I) oc p{d2\9,di,I)pi9\di,I) (46) 
oc p{d2\9,di,I)pidi\9,I)p{9\I). (47) 

We have written Eq. (|47() in a way that the dependence between observables can be accommo- 
dated. From the product rule in Tab. we can rewrite Eq. (|47)) as 

p{9\di,d2,I) « p{di,d2\9,I)p{9\I). (48) 

Comparing this equation with (|47|) we see that the sequential inference gives exactly the same 
result of a single inference that properly takes into account all available information. This is an 
important result of the Bayesian approach. 

The extension to many variables is straightforward, obtaining 

p{9\d,I) oc p{d\e,I)p{e\I). (49) 

Furthermore, when the di are independent, we get for the likelihood 

p{d\e,i) = X{p{di\e,i) (50) 

i 

C{9;d) = HCiO-d^), (51) 

i 

that is, the combined likelihood is given by the product of the individual likelihoods. 



5.6 Multidimensional case — Inferring jj and cr of a Gaussian 

So far we have only inferred one parameter of a model. The extension to many parameters is 
straightforward. Calling 6 the set of parameters and d the data, Bayes' theorem becomes 

Equation 1)52(1 gives the posterior for the full parameter vector 6. Marginalization (see Tab. ^ 
allows one to calculate the probability distribution for a single parameter, for example, p{9i \ d,I), 
by integrating over the remaining parameters. The marginal distribution p{9i \ d, I) is then the 
complete result of the Bayesian inference on the parameter 9i. Though the characterization 
of the marginal is done in the usual way described in Sect. 15.11 there is often the interest to 
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summarize some characters of the multi-dimensional posterior that are unavoidably lost in the 
marginalization (imagine marginalization as a kind of geometrical projection). Useful quantities 
are the covariances between parameters 9i and Oj, defined as 

Cov{ei,e,) = E[(^i-E[0i])(e, -E[^,])]. (53) 

As is well know, quantities which give a more intuitive idea of what is going on are the correlation 
coefficients, defined as p(6i,6j) = Cov{9i,9j)/a{9i)a{9j). Variances and covariances form the 
covariance matrix V(0), with Vu = Var(0j) and Vij = Cov{9i,9j). We recall also that convenient 
formulae to calculate variances and covariances are obtained from the expectation of the products 
9i9j, together with the expectations of the parameters: 

V^J = Ei9i9J)-E{9^)E{9j) (54) 

As a first example of a multidimensional distribution from a data set, we can think, again, at the 
inference of the parameter fj, of a Gaussian distribution, but in the case that also a is unknown 
and needs to be determined by the data. From Eqs. ([52|). (j5fl|) and (|25|l . with 9i = p and 92 = o 
and neglecting overall normalization, we obtain 

p(/i, cr I d, /) oc (T^" exp 

p{fj.\d,I) = J p{fi,a\d,I)da (56) 

pia\d,I) = J p{fi,a\d,I)dn. (57) 

The closed form of Eqs. (|56|) and lFfj) depends on the prior and, perhaps, for the most realistic 
choice of p{fi, cr\ I), such a compact solution does not exists. But this is not an essential issue, 
given the present computational power. (For example, the shape of p{n, a \ I) can be easily 
inspected by a modern graphical tool.) We want to stress here the conceptual simplicity of the 
Bayesian solution to the problem. [In the case the data set contains some more than a dozen 
of observations, a flat p{fJ.,cr \ I), with the constraint o" > 0, can be considered a good practical 
choice.] 



2^2 



p{fj.,a\I) 



(55) 



5.7 Predictive distributions 

A related problem is to 'infer' what an experiment will observe given our best knowledge of 
the underlying theory and its parameters. Infer is within quote marks because the term is 
usually used for model and parameters, rather than for observations. In this case people prefer 
to speak about prediction (or prevision). But we recall that in the Bayesian reasoning there is 
conceptual symmetry between the uncertain quantities which enter the problem. Probability 
density functions describing not yet observed event are referred to as predictive distributions. 
There is a conceptual difference with the likelihood, which also gives a probability of observation, 
but under different hypotheses, as the following example clarifies. 

Given // and a, and assuming a Gaussian model, our uncertainty about a 'future' d/ is 
described by the Gaussian pdf Eq. H25|) with d = df. But this holds only under that particular 
hypothesis for /i and a, while, in general, we are also uncertain about these values too. Applying 
the decomposition formula (Tab. we get: 

Pidf\I) = J p{df\n,a,I)p{n,a\I)d^ida (58) 

Again, the integral might be technically difficult, but the solution is conceptually simple. Note 
that, though the decomposition formula is a general result of probability theory, it can be applied 
to this problem only in the subjective approach. 
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An analytically easy, insightful case is that of experiments with well-known cr's. Given a 
past observation dp and a vague prior, p{fi \dp,I) is Gaussian around dp with variance 0"^ [note 
that, with respect to cr\ I) of Eq.((^. it has been made explicit that p{fj,) depend on dp]. 
p{df I /i) is Gaussian around /i with variance a?. We get finally 



p{df\dp,I) = j p{df\ fi,I)p{fi\dp,I)dfi (59) 

(df - dpf 



exp 



2(cT2 + a2)j 



(60) 



5.8 Hierarchical modelling and hyperparameters 

As we have seen in the previous section, it is often desirable to include in a probabilistic model 
one's uncertainty in various aspects of a pdf. This is a natural feature of the Bayesian methods, 
due to the uniform approach to deal with uncertainty and from which powerful analysis tools 
are derived. This kind of this modelling is called hierarchical because the characteristics of one 
pdf are controlled by another pdf. All uncertain parameters from which the pdf depends are 
called hyperparameter. An example of use of hyperparameter is described in Sect. in which 
the prior to infer ^ in a binomial model are shown to be controlled by the parameters of a Beta 
distribution. 

As an example of practical importance, think of the combination of experimental results 
in the presence of outliers, i.e. of data points which are somehow in mutual disagreement. In 
this case the combination rule given by Eqs. extended to many data points, pro- 

duces unacceptable conclusions. A way of solving the problem (Dose and von der Linden 1999, 
D'Agostini 1999b) is to model a scepticism about the quoted standard deviations of the exper- 
iments, introducing a pdf f{r), where r is a rescaling factor of the standard deviation. In this 
way the u's that enter the r.h.s. of Eqs. (|30|) - (|32j) are hyperparameters of the problem. An 
alternative approach, also based on hierarchical modelling, is shown in (Frohner 2000). For a 
more complete introduction to the subject see e.g. (Gelman et al 1995). 



5.9 From Bayesian inference to maximum-likelihood and minimum chi- 
square model fitting 

Let us continue with the case in which we know so little about appropriate values of the param- 
eters that a uniform distribution is a practical choice for the prior. Equation (|52j) becomes 

p{e\d,i) oc p{d\e,i)pQ{e,i) (xp{d\e,i) = c{e;d) , (ei) 

where, we recall, the likelihood C{0; d) is seen as a mathematical function of 0, with parameters 
d. 

The set of that is most likely is that which maximizes C{0;d), a result known as the 
maximum likelihood principle. Here it has been obtained again as a special case of a more 
general framework, under clearly stated hypotheses, without need to introduce new ad hoc 
rules. Note also that the inference does not depend on multiplicative factors in the likelihood. 
This is one of the ways to state the likelihood principle, ideally desired by frequentists, but often 
violated. This 'principle' always and naturally holds in Bayesian statistics. It is important to 
remark that the use of unnecessary principles is dangerous, because there is a tendency to use 
them uncritically. For example, formulae resulting from maximum likelihood are often used also 
when non-uniform reasonable priors should be taken into account, or when the shape of C{6; d) 
is far from being multi-variate Gaussian. (This is a kind of ancillary default hypothesis that 
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comes together with this principle, and is the source of the often misused 'A(— ln£) = 1/2' rule 
to determine probability intervals.) 

The usual least squares formulae are easily derived if we take the well-known case of pairs 
{xi,yi} (the generic d stands for all data points) whose true values are related by a deterministic 
function /ij^. = yifJ-x^, ^) and with Gaussian errors only in the ordinates, i.e. we consider Xi « fixi- 
In the case of independence of the measurements, the likelihood-dominated result becomes, 



p{9\x,y,I) 



oc 



J|exp 



2 a? 



(62) 



or 



p{9\x,y,I) oc exp 



where 



x\0) 



E 



(63) 



(64) 



is called 'chi-square,' well known among physicists. Maximizing the likelihood is equivalent to 
minimizing x^, and the most probable value of 6 is easily obtained (i.e. the mode indicated with 
Ofn), analytically in easy cases, or numerically for more complex ones. 

As far as the uncertainty in is concerned, the widely-used evaluation of the covariance 
matrix V(0) (see Sect. 15.6(1 from the Hessian, 



1 d^x' 



2 dOidOj 



6=0„ 



(65) 



is merely consequence of an assumed multi-variate Gaussian distribution of 6, that is a parabolic 
shape of (note that the 'A(— ln£) = 1/2' rule, and the from this rule resulting 'A^^ = 1 
rule,' has the same motivation). In fact, expanding x^(^) iii series around its minimum, we have 



(66) 



where stands for the the set of differences 9i — and H is the Hessian matrix, whose 
elements are given by twice the r.s.h. of Eq. (j65() . Equation (j63j) becomes then 



p{e\x,y,I) 



oc exp 



1 



H A^ 



(67) 



which we recognize to be a multi-variate Gaussian distribution if we identify H = V ^ . After 
normalization, we get finally 



p{e\x,y,I) « (2^)-"/2(detV)-i/2 exp 



2 



with n equal to the dimension of and det V indicating the determinant of V. Holding this 
approximation, E(0) is approximately equal to 9m- Note that the result (|68|) is exact when 
yilJ'Xi, 9) depends linearly on the various Oi. 

In routine applications, the hypotheses that lead to the maximum likelihood and least squares 
formulae often hold. But when these hypotheses are not justified, we need to characterize the 
result by the multi-dimensional posterior distribution p{9), going back to the more general 
expression Eq. (|^ . 



18 



The important conclusion from this section, as was the case for the definitions of probabihty 
in Sect. |31 is that Bayesian methods often lead to well-known conventional results, but without 
introducing them as new ad hoc rules as the need arises. The analyst acquires then a heightened 
sense of awareness about the range of validity of the methods. One might as well use these 
'recovered' methods within the Bayesian framework, with its more natural interpretation of the 
results. Then one can speak about the uncertainty in the model parameters and quantify it with 
probability values, which is the usual way in which physicists think. 



5.10 Gaussian approximation of the posterior distribution 

The substance of the results seen in the previous section holds also in the case in which the prior 
is not flat and, hence, cannot be absorbed in the normalization constant of the posterior. In 
fact, in many practical cases the posterior exhibits an approximately (multi-variate) Gaussian 
shape, even if the prior was not trivial. Having at hand an un-normalized posterior p(), i.e. 

p{e\d,i) = p{d\e,i)po{e,i), m 

we can take its minus-log function (p{6) = — lnp(9). If p(0 | x,y,I) has approximately a Gaus- 
sian shape, it follows that 

ip{6) « ^ Ae'^ V-'^ AO + constant . (70) 



V can be evaluated as 



where 6^ was obtained from the minimum of '■p{0). 



(71) 

6 = 6m 



6 Uncertainties from systematic effects 

The uncertainty described in the previous section are related to the so-called random, or statis- 
tical errors. Other important sources are, generally speaking (see ISO 1993 for details), related 
to uncertain values of influence variables on which the observed values, or the data-analysis pro- 
cess, might depend. In physics, we usually refer to these as systematic effects or errors. They 
can be related to the parameters of the experiment, like a particle beam energy or the exposure 
time, or to environmental variables, like temperature and pressure, calibration constants of the 
detector, and all other parameters, 'constants' (in the physical sense), and hypotheses that enter 
the data analysis. The important thing is that we are unsure about their precise value. Let us 
indicate all the influence variables with the vector h = {hi, /i2, . . . , hn}-, and their joint pdf as 
p{h\I). 

The treatment of uncertainties due to systematic errors has traditionally been lacking a 
consistent theory, essentially due the unsuitability to standard statistical methods of dealing 
with uncertainty in the most wide sense. Bayesian reasoning becomes crucial to handle these 
sources of uncertainty too, and even metrological organizations (ISO 1993) had to recognized it. 
For example, the ISO type B uncertainty is recommended to be "evaluated by scientific judgment 
based on all the available information on the possible variability" (ISO 1993) of the influence 
quantities (see also D'Agostini 1999c). 
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6.1 Reweighting of conditional inferences 



The values of the influence variables and their uncertainties contribute to our background knowl- 
edge I about the experimental measurements. Using Iq to represent our very general background 
knowledge, the posterior pdf will then be p{^ \ d,h,lQ), where the dependence on all possible 
values of h has been made explicit. The inference that takes into account the uncertain vector 
h is obtained using the rules of probability (see Tab. ^ by integrating the joint probability over 
the uninteresting influence variables: 



(72) 
(73) 



p{n\d,Io) = J p{^i,h\d,Io)dh 

= Jp{fi\d,h,Io)p{h\Io)dh. 



As a simple, but important case, let us consider a single influence variable given by an additive 
instrumental offset z, which is expected to be zero because the instrument has been calibrated 
as well as feasible and the remaining uncertainty is cr^. Modelling our uncertainty in 2; as a 
Gaussian distribution with a standard deviation cr^, the posterior for /x is 



p{n\d,Io) 



+00 



p{fj, I d, z, a, Jo) p{z \az,Io) dz 



(74) 
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2{a^ + al) 



(75) 
(76) 



The result is that the net variance is the sum of the variance in the measurement and the 
variance in the influence variable. 



6.2 Joint inference and marginalization of nuisance parameters 

A different approach, which produces identical results, is to think of a joint inference about both 
the quantities of interest and the influence variables: 

p{fi,h\d,Io) oc p{d\fi,h,Io)po{n,h\Io). (77) 

Then, marginalization is applied to the variables that we are not interested in (the so called 
nuisance parameters), obtaining 

p{fi\d,Io) = Jp{iJ,h\d,Io)dh (78) 
oc J p{d\fi,h,Io)po{fj.,h\Io)dh. (79) 

Equation H77|) shows a peculiar feature of Bayesian inference, namely the possibility making an 
inference about a number of variables larger than the number of the observed data. Certainly, 
there is no magic in it, and the resulting variables will be highly correlated. Moreover, the 
prior cannot be improper in all variables. But, by using informative priors in which experts 
feel confident, this feature allows one to tackle complex problems with missing or corrupted 
parameters. In the end, making use of marginalization, one can concentrate on the quantities 
of real interest. 
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The formulation of the problem in terms of Eqs. (|77)) and (|79|) allows one to solve problems in 
which the influence variables might depend on the true value /i, because po{n, h \ Iq) can model 
dependences between fj, and h. In most applications, h does not depend on fi, and the prior 
factors into the product of po{fi \ Iq) and poih \ Iq). When this happens, we recover exactly the 
same results as obtained using the reweighting of conditional inferences approach described just 
above. 

6.3 Correlation in results caused by systematic errors 

We can easily extend Eqs. (f73|) . (|77j) . and (f79|) to a joint inference of several variables, which, 
as we have seen, are nothing but parameters 6 of suitable models. Using the alternative ways 
described in Sects. IHTTl and we have 

pie\d,h,Io) oc p{d\6,h,Io)po{e\Io) (80) 

pie\d,io) = Jp{e\d,h,io)p{h\io)dh (81) 

and 

p{9,h\d,Io) oc p{d\e,h,Io)po{6,h\Io) (82) 

pi9\d,Io) = Jp{e,h\d,Io)dh, (83) 

respectively. The two ways lead to an identical result, as it can be seen comparing Eqs. H81|) 
and (IHHl) . 

Take a simple case of a common offset error of an instrument used to measure various quan- 
tities ^ii , resulting in the measurements di . We model each measurement as fii plus an error that 
is Gaussian distributed with a mean of zero and a standard deviation cxj . The calculation of the 
posterior distribution can be performed analytically, with the following results (see D'Agostini 
1999c for details): 

• The uncertainty in each /Xj is described by a Gaussian centered at di, with standard 



deviation a{fii) = y + erf, consistent with Eq. ((7B|) . 

The joint posterior distribution p{fii,fj,2, ■ ■ ■) does not factorize into the product of pifJ-i), 
p{p2), etc., because correlations are automatically introduced by the formalism, consistent 
with the intuitive thinking of what a common systematic should do. Therefore, the joint 
distribution will be a multi-variate Gaussian that takes into account correlation terms. 

The correlation coefficient between any pair is given by 

2 2 



We see that p{jii,iJLj) has the behavior expected from a common offset error; it is non- 
negative; it varies from practically zero, indicating negligible correlation, when (cj^ <C cij), 
to unity (cj^ ^> cjj), when the offset error dominates. 



6.4 Approximate methods and standard propagation applied to systematic 
errors 

When we have many uncertain influence factors and/or the model of uncertainty is non-Gaussian, 
the analytic solution of Eq. ((ZSJ, or Eqs. (f77 jl - (|79j) can be complicated, or not existing at all. 
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Then numeric or approximate methods are needed. The most powerful numerical methods are 
based on Monte Carlo (MC) techniques (see Sect-IHlfor a short account). This issue goes beyond 
the aim of this report. In a recent comprehensive particle-physics paper by Ciuchini et al (2001), 
these ideas have been used to infer the fundamental parameters of the Standard Model of particle 
physics, using all available experimental information. 

For routine use, a practical approximate method can be developed by thinking of the value 
inferred for the expected value of /i as a raw value, indicated with /i^, that is, fiR = /i h = E{h) 
('raw' in the sense that it needs later to be 'corrected' for all possible value of 6, as it will be 
clear in a while). The value of fx, which depends on the possible values of h, can be seen as a 
function of /i/j and h: 



(85) 



We have thus turned our inferential problem into a standard problem of evaluation of the pdf of a 
function of variables, of which are particularly known the formulae to obtain approximate values 
for expectations and standard deviations in the case of independent input quantities (following 
the nomenclature of ISO 1993): 



E(/i) 



E(Mfl),E(^)y 



(86) 




(87) 



Extension to multi-dimensional problems and treatment of correlations is straightforward (the 
well-known covariance matrix propagation) and we refer to (D'Agostini and Raso 1999) for 
details. In particular, this reference contains approximate formulae valid up to second order, 
which allow to take into account relatively easily non linearities. 



7 Comparison of models of different complexity 

We have seen so far two typical inferential situations: 

1. Comparison of simple models (Sect.^, where by simple we mean that the models do not 
depend on parameters to be tuned to the experimental data. 

2. Parametric inference given a model, to which we have devoted the last sections. 

A more complex situation arises when we have several models, each of which might depend on 
several parameters. For simplicity, let us consider model A with ua parameters a and model B 
with ns parameters /3. In principle, the same Bayesian reasoning seen previously holds: 

P(A|Data,I) _ P(Data|A,I) P{A\I) 
P{B I Data, /) ~ P(Data|5,I) P{B\I) ' 

but we have to remember that the probability of the data, given a model, depends on the 
probability of the data, given a model and any particular set of parameters, weighted with the 
prior beliefs about parameters. We can use the same decomposition formula (see Tab.P), already 
applied in treating systematic errors (Sect.lHl): 

P(Data|M,/) = /p(Data | M, 0, /) p(0 | /) d0 , (89) 
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with M = A,B and = a,p. In particular, the Bayes factor appearing in Eq. (|88j) becomes 



P(Data|A,/) _ J P{Bata\ A, a, I) p{a\ I) da 

P(Data\B,I) ~ J P{Bata \ B , /S, I) p{P \ I) d/3 ^^^^ 
_ Data)po(Q)dQ 

/£ij(/3; Data) po(/3) d/3 • ^ ^ 

The inference depends on the marginalized likelihood (|89)) . also known as the evidence. Note that 
CM{d; Data) has its largest value around the maximum likelihood point 6ml, but the evidence 
takes into account all prior possibilities of the parameters. Thus, it is not enough that the best 
fit of one model is superior to its alternative, in the sense that, for instance, 

-Cyi(Q:AfL; Data) > £b(/3a/l; Data) , (92) 

and hence, assuming Gaussian models, 

XA("mmx2; Data) < XB(/3mmx2' ^^*^) ' (^^) 

to prefer model A. We have already seen that we need to take into account the prior beliefs in 
A and B. But even this is not enough: we also need to consider the space of possibilities and 
then the adaptation capability of each model. It is well understood that we do not choose an 
{n — 1) order polynomial as the best description - 'best' in inferential terms - of n experimental 
points, though such a model always offers an exact pointwise fit. Similarly, we are much more 
impressed by, and we tend a posteriori to believe more in, a theory that absolutely predicts an 
experimental observation, within a reasonable error, than another theory that performs similarly 
or even better after having adjusted many parameters. 

This intuitive reasoning is expressed formally in Eqs. H9U|) and (|91() . The evidence is given 
integrating the product C{6) and po{6) over the parameter space. So, the more po{0) is concen- 
trated around 0ml, the greater is the evidence in favor of that model. Instead, a model with a 
volume of the parameter space much larger than the one selected by C{9) gets disfavored. The 
extreme limit is that of a hypothetical model with so many parameters to describe whatever 
we shall observe. This effect is very welcome, and follows the Ockham's Razor scientific rule of 
discarding unnecessarily complicated models ( "entities should not be multiplied unnecessarily") . 
This rule comes out of the Bayesian approach automatically and it is discussed, with examples 
of applications in many papers. Berger and Jefferys (1992) introduce the connection between 
Ockham's Razor and Bayesian reasoning, and discuss the evidence provided by the motion of 
Mercury's perihelion in favor of Einstein's general relativity theory, compared to alternatives at 
that time. Examples of recent applications are Loredo and Lamb 2002 (analysis of neutrinos 
observed from supernova SN 1987A), John and Narlikar 2002 (comparisons of cosmological mod- 
els), Hobson et al 2002 (combination of cosmological datasets) and Astone et al 2003 (analysis 
of coincidence data from gravitational wave detectors). These papers also give a concise account 
of underlying Bayesian ideas. 

After having emphasized the merits of model comparison formalized in Eqs. H90|) and ()91() . 
it is important to mention a related problem. In parametric inference we have seen that we can 
make an easy use of improper priors (see Tab. seen as limits of proper priors, essentially 
because they simplify in the Bayes formula. For example, we considered pq{^\I) of Eq. 
to be a constant, but this constant goes to zero as the range of ji diverges. Therefore, it does 
simplify in Eq. (|26|) . but not, in general, in Eqs. (|90jl and I|9H1. unless models A and B depend 
on the same number of parameters defined in the same ranges. Therefore, the general case of 
model comparison is limited to proper priors, and needs to be thought through better than when 
making parametric inference. 
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8 Choice of priors — a closer look 



So far, we have considered mainly likelihood-dominated situations, in which the prior pdf can 
be included in the normalization constant. But one should be careful about the possibility 
of uncritically use uniform priors, as a 'prescription,' or as a rule, though the rule might be 
associated with the name of famous persons. For instance, having made N interviews to infer 
the proportion ^ of a population that supports a party, it is not reasonable to assume a uniform 
prior of 6 between and 1. Similarly, having to infer the rate r of a Poisson process (such 
that \ = rT, where T is the measuring time) related, for example, to proton decay, cosmic ray 
events or gravitational wave signals, we do not believe, strictly, that p{r) is uniform between 
zero and infinity. Besides natural physical cut-off's (for example, very large proton decay r 
would prevent Life, or even stars, to exist), p{r) = constant implies to believe more high orders 
of magnitudes of r (see Astone and D'Agostini 1999 for details). In many cases (for example the 
mentioned searches for rare phenomena) our uncertainty could mean indifference over several 
orders of magnitude in the rate r. This indifference can be parametrized roughly with a prior 
uniform Inr yielding p{r) oc 1/r (the same prior is obtainable using invariant arguments, as it 
will be shown in a while). 

As the reader might imagine, the choice of priors is a highly debated issue, also among 
Bayesians. We do not pretend to give definitive statements, but would just like to touch on 
some important issues concerning priors. 

8.1 Logical and practical role of priors 

Priors are pointed to by those critical of the Bayesian approach as the major weakness of the 
theory. Instead, Bayesians consider them a crucial and powerful key point of the method. 
Priors are logically crucial because they are necessary to make probability inversions via Bayes' 
theorem. This point remains valid even in the case in which they are vague and apparently 
disappear in the Bayes' formula. Priors are powerful because they allow to deal with realistic 
situations in which informative prior knowledge can be taken into account and properly balanced 
with the experimental information. 

Indeed, we think that one of the advantages of Bayesian analysis is that it explicitly admits 
the existence of prior information, which naturally leads to the expectation that the prior will 
be specified in any honest account of a Bayesian analysis. This crucial point is often obscured in 
other types of analyses, in large part because the analysts maintain their method is 'objective.' 
Therefore, it is not easy, in those analyses, to recognize what are the specific assumptions made 
by the analyst — in practice the analyst's priors — and the assumptions included in the method 
(the latter assumptions are often unknown to the average practitioner). 

8.2 Purely subjective assessment of prior probabilities 

In principle, the point is simple, at least in one-dimensional problem in which there is good 
perception of the possible range in which the uncertain variable of interest could lie: try your 
best to model your prior beliefs. In practice, this advice seems difficult to follow because, even if 
we have a rough idea of what the value of a quantity should be, the representation of the prior in 
mathematical terms seems very committal, because a pdf implicitly contains an infinite number 
of precise probabilistic statements. (Even the uniform distribution says that we believe exactly in 
the same way to all values. Who believes exactly that?) It is then important to understand that, 
when expressing priors, what matters is not the precise mathematical formula, but the gross 
value of the probability mass indicated by the formula, how probabilities are intuitively perceived 
and how priors influence posteriors. When we say, intuitively, we believe something with a 95% 
confidence, it means "we are almost sure," but the precise value (95%, instead of 92% or 98%) 



24 



is not very relevant. Similarly, when we say that the prior knowledge is modeled by a Gaussian 
distribution centered around /xq with standard deviation ctq [Eq. (^5]) ]. it means means that we 
are quite confident that fi is within ifio, very sure that it is within =b2(To and almost certain 
that it is within ib3o"o. Values even farther from fiQ are possible, though we do not consider 
them very likely. But all models should be taken with a grain of salt, remembering that they are 
often just mathematical conveniences. For example, a textbook-Gaussian prior includes infinite 
deviations from the expected value and even negative values for physical quantities positively 
defined, like a temperature or a length. All absurdities, if taken literally. On the other hand, we 
think that all experienced physicists have in mind priors with low probability long tails in order 
to accommodate strong deviation from what is expected with highest probability. (Remember 
that where the prior is zero, the posterior must also be zero.) 

Summing up this point, it is important to understand that a prior should tell where the 
probability mass is concentrated, without taking too seriously the details, especially the tails 
of the distribution (which should be, however, enough extended to accommodate 'surprises'). 
The nice feature of Bayes' theorem is the ability of trasform such vague, fuzzy priors into solid 
estimates, if a sufficient amount of good quality data are at hand. For this reason, the use 
of improper priors is not considered to be problematic. Indeed, improper priors can just be 
considered a convenient way of modelling relative beliefs. 

In the case we have doubts about the choice of the prior, we can consider a family of functions 
with some hyperparameters. If we worry about the effect of the chosen prior on the posterior, 
we can perform a sensitivity analysis, i.e. to repeat the analysis for different, reasonable choices 
of the prior and check the variation of the result. The final uncertainty could, then, take into 
account also the uncertainty on the prior. Finally, in extreme cases in which priors play a crucial 
role and could dramatically change the conclusions, one should refrain to give probabilistic result, 
providing, instead, only Bayes factors, or even just likelihoods. An example of a recent result 
about gravitational wave searches presented in this way, see Astone et al (2002). 

Having clarified meaning and role of priors, it is rather evident that the practical choice of a 
prior depends on what is appropriate for the application. For example, in the area of imaging, 
smoothness of a reconstructed image might be appropriate in some situations. Smoothness may 
be imposed by a variety of means, for example, by simply setting the logarithm of the prior 
equal to an integral of the square of the second derivative of the image (von der Linden et al 
1996b). A more sophisticated approach goes under the name of Markov random fields (MRF), 
which can even preserve sharp edges in the estimated images (Bouman and Sauer 1993, Saquib 
et al 1997). A similar kind of prior is often appropriate for defomable geometric models, which 
can be used to represent the boundaries between various regions, for example, organs in medical 
images (Cunningham et al 1998). 

A procedure that helps in choosing the prior, expecially important in the cases in which 
the parameters do not have a straightforwardly perceptible influence on data, is to build a 
prior predictive pdf and check if this pdf would produce data conform with our prior beliefs. 
The prior predictive distribution is the analogue of the {posterior) predictive distribution we 
met in Sect. 15.71 with p{6 \ d, I) replaced by p{6 \ I) (note that the example of Sect. 15.71 was 
one-dimensional, with di = df and 6i = //), i.e. p{d | /) = / p{d \ 6, 1)p{6 \ I) dO. 

Often, expecially in complicated data analyses, we are not sufficiently knowledgable about 
the details of the problem. Thus, informative priors have to be modelled that capture the 
judgement of experts. For example, Meyer and Booker (2001) show a formal process of prior 
elicitation which has the aim at reducing, as much as possible, the bias in the experts' estimates 
of their confidence limits. This approach allows one to combine the results from several experts. 
In short, we can suggest the use of the 'coherent bet' (Sect. HI) to force experts to access their 
values of probability, asking them to provide an interval in which they feel 'practically sure', 
intervals on which they could wager 1:1, and so on. 
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8.3 Conjugate priors 



Because of computational problems, modelling priors has been traditionally a compromise be- 
tween a realistic assessment of beliefs and choosing a mathematical function that simplifies the 
analytic calculations. A well-known strategy is to choose a prior with a suitable form so the 
posterior belongs to the same functional family as the prior. The choice of the family depends on 
the likelihood. A prior and posterior chosen in this way are said to be conjugate. For instance, 
given a Gaussian likelihood and choosing a Gaussian prior, the posterior is still Gaussian, as we 
have seen in Eqs. H25() . (^5]) and H29() . This is because expressions of the form 



K exp 

can always be written in the form 

K' exp 



2ct2 



2a| 



2a'2 



The 



with suitable values for x', a' and K' . The Gaussian distribution is auto-conjugate, 
mathematics is simplified but, unfortunately, only one shape is possible. 

An interesting case, both for flexibility and practical interest is offered by the binomial 
likelihood (see Sect. 15. 3() . Apart from the binomial coefficient, p{n \ 9, N) has the shape 9^{1 — 
0\N-n T^i^i^]^ ]-^g_g ii^Q same structure as the Beta distribution, well known to statisticians: 



Beta(6' | r, s) 
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-9'-\l 



P{r,s) 

where P{r,s) stands for the Beta function, defined as 
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which can be expressed in terms of Euler's Gamma function as P{r,s) = r(r)r(s)/r(r -|- s 
Expectation and variance of the Beta distribution are: 
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(96) 
(97) 



{r + s + l)(r + s)2 

If r > 1 and s > 1, then the mode is unique, and it is at 9^ = (r — l)/(^ -|- s — 2). Depending 
on the value of the parameters the Beta pdf can take a large variety of shapes. For example, for 
large values of r and s, the function is very similar to a Gaussian distribution, while a constant 
function is obtained for r = s = 1. Using the Beta pdf as prior function in inferential problems 
with a binomial likelihood, we have 
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re, and expectation and 



"(1-0)^ 

The posterior distribution is still a Beta with r' = r + n and s' = s + N - 
standard deviation can be calculated easily from Eqs. (|96|) and (|97j) . These formulae demonstrate 
how the posterior estimates become progressively independent of the prior information in the 
limit of large numbers; this happens when both m ^ r and n — m ^ s. In this limit, we get the 
same result as for a uniform prior (r = s = 1). 



Tablel^lists some of the more useful conjugate priors. For a more complete collection of conjugate 
priors, see e.g. (Bernardo and Smith 1994, Gelman et al 1995). 
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Table 2: Some useful conjugate priors, x and n stand for the observed value (continuous or 
discrete, respectively) and 9 is the generic symbol for the parameter to infer, corresponding to 
^ of a Gaussian, 6 of a binomial and X of a Poisson distribution. 



likelihood 


conjugate prior 


posterior 


p{x 9) 




p{9 x) 


Normal(0, a) 


Normal(/Uo, ctq) 


Normal(^i,c7i) [Eqs. (|3n))~(|32|l] 


Binomial(iV, 9) 


Beta(r, s) 


Beta(r + n, s + N — n) 


Poisson (^) 


Gamma(r, s) 


Gamma(r + n, s + 1) 


Multinomial(0i, . . 


. , 9k) Dirichlet(ai, . . . 


, afc) Dirichlet(ai + ni, . . . , a/c + n^) 



8.4 General principle based priors 

Many who advocate using the Bayesian approach still want to keep 'subjective' contributions to 
the inference to a minimum. Their aim is to derive prior functions based on 'objective' arguments 
or general principles. As the reader might guess, this subject is rather controversial, and the risk 
of trasforming arguments, which might well be reasonable and useful in many circumstances, 
into dogmatic rules is high (D'Agostini, 1999e). 



8.4.1 Transformation invariance 

An important class of priors arises from the requirement of transformation invariance. We shall 
consider two specific cases, translation invariance and scale invariance. 

Translation invariance 

Let us assume we are indifferent over a transformation of the kind 9' = 9 + b, where 9 is 
our variable of interest and b a constant. Then p{9) d9 is an infinitesimal mass element of 
probability for 9 to be in the interval d^. Translation invariance requires that this mass 
element remains unchanged when expressed in terms of 9', i.e. 

p{9) d9 = p{9') d9' (100) 
= p{9 + b)d9, (101) 

since d9 = d9'. It is easy to see that in order for Eq. (llOlj) to hold for any b, p{9) must be 
equal to a constant for all values of 9 from — oo to +oo. It is therefore an improper prior. 
As discussed above, this is just a convenient modelling. For practical purposes this prior 
should always be regarded as the limit for ^ oo of p{9) = 1/ A9, where A9 is a large 
finite range around the values of interest. 

Scale invariance 

In other cases, we could be indifferent about a scale transformation, that \s 9' = [5 9, where 
/3 is a constant. This invariance implies, since d9' = /3d^ in this case, 

p{9) d9 = p{p9)Pd9, (102) 

i.e. 

p{(39) = (103) 
The solution of this functional equation is 

p{9) a I, (104) 
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as can be easily proved using Eq. p()4|) as test solution in Eq. This is the famous 

Jeffreys ' prior, since it was first proposed by Jeffreys. Note that this prior also can be 
stated as pilogO) = constant, as can be easily verified. The requirement of scale invariance 
also produces an improper prior, in the range Q < 9 < oo. Again, the improper prior must 
be understood as the limit of a proper prior extending several orders of magnitude around 
the values of interest. [Note that we constrain 6 to be positive because, traditionally, 
variables which are believed to satisfy this invariance are associated with positively defined 
quantities. Indeed, Eq. H104|) has a symmetric solution for negative quantities.] 

According to the supporters of these invariance motivated priors (see e.g. Jaynes 1968, 1973, 
1998, Sivia 1997, and Frohner 2000, Dose 2002) variables associated to translation invariance 
are location parameters, as the parameter /i in a Gaussian model; variables associated to scale 
invariance are scale parameters, like cr in a Gaussian model or A in a Poisson model. For criticism 
about the (mis-)use of this kind of prior see (D'Agostini 1999d). 

8.4.2 Maximum- entropy priors 

Another principle-based approach to assigning priors is based on in the Maximum Entropy 
principle (Jaynes 1957a, also 1983, 1998, Tribus 1969, von der Linden 1995, Sivia 1997, and 
Frohner 2000). The basic idea is to choose the prior function that maximizes the Shannon- 
Jaynes information entropy, 

n 

S = -Y^p^lnpi, (105) 

i 

subject to whatever is assumed to be known about the distribution. The larger S is, the 
greater is our ignorance about the uncertain value of interest. The value 5 = is obtained 
for a distribution that concentrates all the probability into a single value. In the case of no 
constraint other than normalization, Pi = 1), S" is maximized by the uniform distribution. 
Pi = 1/n, which is easily proved using Lagrange multipliers. For example, if the variable is an 
integer between and 10, a uniform distribution p{xi) = 1/11 gives S = 2.40. Any binomial 
distribution with n = 10 gives a smaller value, with a maximum of = 1.88 for ^ = 1/2 and a 
limit of S = for ^ or or ^ — > 1, where 6 is now the parameter of the binomial that gives 
the probability of success at each trial. 

Two famous cases of maximum-entropy priors for continuous variables are when the only 
information about the distribution is either the expected value or the expected value and the 
variance. Indeed, these are special cases of general constraints on the moments of the distribution 
My. (see Tab.P). For r = and 1, Mr is equal to unity and to the expected value, respectively. 
First and second moment together provide the variance (see Tab. ^ and Sect. EH). Let us sum 
up what the assumed knowledge on the various moments provides [see e.g. (Sivia 1997, Dose 
2002)]. 

Knowledge about Mq 

Normalization alone provides a uniform distribution over the interval in which the variable 
is defined: 

pie I Mo = 1) = a<e<h. (106) 

— a 

This is the extension to continuous variables of the discrete case we saw above. 

Knowledge about Mq and Mi [i.e. about E{6)] 

Adding to the constraint Mq = 1 the knowledge about the expectation of the variable. 
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plus the requirement that all non- negative values are allowed, an exponential distribution 
is obtained: 

p{e\Mo = l,Mi=E{0)) = :^e-^/^(^) 0<e<oo. (107) 

E(6') 

Knowledge about Mq, Mi and M2 [i.e. about E{9) and cr(9)] 

Finally, the further constraint provided by the standard deviation (related to first and 
second moment by the equation o"^ = M2 — Mf) yields a prior with a Gaussian shape 
independently of the range of 9, i.e. 

exp n — TfW\ — 

pi9\Mo = lM0),-m = 1 a<9<b. 

Ja 6XP [ 2 0-2(61) J CLt/ 

(108) 

The standard Gaussian is recovered when 9 is allowed to be any real number. 

Note, however, that the counterpart of Eq. (|1()5|) for continuous variables is not trivial, since all 
Pi of Eq. ()1U5() tend to zero. Hence the analogous functional form / p{9) \np{9) d9 no longer has 
a sensible interpretation in terms of uncertainty, as remarked by Bernardo and Smith (1994). 
The Jaynes' solution is to introduce a 'reference' density m{9) to make entropy invariant under 
coordinate transformation via / p{9)ln\p{9)/m{9)]d9. (It is important to remark that the first 
and the third case discussed above are valid under the assumption of a unity reference density.) 
This solution is not universally accepted (see Bernardo and Smith 1994), even though it conforms 
to the requirements of dimensional analysis. Anyhow, besides formal aspects and the undeniable 
aid of Maximum Entropy methods in complicate problems such as image reconstruction (Buck 
and Macauly 1991), we find it very difficult, if not impossible at all, that a practitioner holds that 
status of knowledge which give rise to the two celebrated cases discussed above. We find more 
reasonable the approach described in Sect. 18.21 that goes the other way around: we have a rough 
idea of where the quantity of interest could be, then we try to model it and to summarize it in 
terms of expected value and standard deviation. In particular, we find untenable the position of 
those who state that Gaussian distribution can only be justified by Maximum Entropy principle. 



8.4.3 Reference priors 

We conclude this discussion on priors by mentioning 'reference analysis,' which is an area of 
active research among statisticians. The intention is, similarly to that for other priors motivated 
by basic principles, that of "characterizing a ^non-informative' or ''objective^ prior distribution, 
representing ^prior ignorance,' ^ vague prior knowledge,' and ^letting the data speak for themselves' 
" (Bernardo and Smith 1994). However, "the problem is more complex than the apparent 
intuitive immediacy of these words and phrases would suggest" (Bernardo and Smith 1994, 
p. 298): 

"Put bluntly: data cannot ever speak entirely for themselves: every prior specification lias 
some informative posterior or predictive implications; and 'vague' is itself mucfi too vague 
an idea to be useful. There is no 'objective' prior that represents ignorance. 

On the other hand, we recognize that there is often a pragmatically important need for 
a form of prior to posterior analysis capturing, in some well-defined sense, the notion of the 
prior having a minimal effect, relative to the data, on the final inference. Such a reference 
analysis might be required as an approximation to actual beliefs; more typically, it might be 
required as a limiting 'what if?' baseline in considering a range of prior to posterior analyses, 
or as a default option when there are insufEcient resources for detailed elicitation of actual 
prior knowledge. 
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. . . From the approach we adopt, it wiU be clear that the reference prior component of 
the analysis is simply a mathematical tool. It has considerable pragmatic importance in 
implementing a reference analysis, whose role and character will be precisely defined, but it 
is not a privileged, 'unique non-informative' or 'objective' prior." 

The curious reader may take a look at the (Bernardo and Smith 1994) and references cited 
therein, as well as at Bernardo (1997). 

9 Computational issues 

The application of Bayesian ideas leads to computational problems, mostly related to the cal- 
culation of integrals for normalizing the posterior pdf and for obtaining credibility regions, or 
simply the moments of the distribution (and, hence, expectations, variances and covariances) . 
The difficulties become challenging for problems involving many parameters. This is one of the 
reasons why Bayesian inference was abandoned at the beginning of the 20**^ century in favor 
of simplified - and simplistic - methods. Indeed, the Bayesian renaissance over the past few 
decades is largely due to the emergence of new numerical methods and the dramatic increases 
in computational power, along with clarifying work on the foundations of the theory. 

9.1 Overview of approximate computational strategies 

In previous sections we have already seen some 'tricks' for simplifying the calculations. The 
main topic of this section will be an introduction to Monte Carlo (MC). But, before doing that, 
we think it is important to summarize the various 'tricks' here. Much specialized literature is 
available on several aspects of computation in statistics. For an excellent review paper on the 
subject see (Smith 1991). 

Conjugate priors 

We discussed this topic in Sect. 18.31 giving a couple of typical simple examples and refer- 
ences for a more detailed list of famous conjugate distributions. We want to remark here 
that a conjugate prior is a special case of the class of priors that simplify the calculation 
of the posterior (the uniform prior is the simplest of this kind of prior). 

Gaussian approximation 

For reasons that are connected with the central limit theorem, when there is a large amount 
of consistent data the posterior tends to be Gaussian, practically independently of the exact 
shape of the prior. The (multi-variate) Gaussian approximation, which we encountered in 
Sect. 15.101 has an important role for applications, either as a reasonable approximation 
of the 'true' posterior, or as a starting point for searching for a more accurate description 
of it. We also saw that in the case of practically flat priors this method recovers the 
well-known minimum chi-square or maximum likelihood methods. 

Numerical integration 

In the case of low dimensional problems, standard numerical integration using either scien- 
tific library functions or the interactive tools of modern computer packages provide an easy 
solution to many problems (thanks also to the graphical capabilities of modern programs 
which allow the shape of the posterior to be inspected and the best calculation strategy 
decided upon). This is a vast and growing subject, into we cannot enter in any depth here, 
but we assume the reader is familiar with some of these programs or packages. 

Monte Carlo methods 

Monte Carlo methodology is a science in itself and it is way beyond our remit to provide 
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an exhaustive introduction to it here. Nevertheless, we would like to introduce briefly 
some 'modern' (though the seminal work is already half a century old) methods which are 
becoming extremely popular and are often associated with Bayesian analysis, the so called 
Markov Chain Monte Carlo (MCMC) methods. 

9.2 Monte Carlo methods 

9.2.1 Estimating expectations by sampling 

The easy part of the Bayesian approach is to write down the un-normalized distribution of 
the parameters (Sect. I5.10|) . given the prior and the likelihood. This is simply p{0\d,I) = 
p{d I 6, 1) p{6 I /). The difficult task is to normalize this function and to calculate all expectations 
in which we are interested, such as expected values, variances, covariances and other moments. 
We might also want to get marginal distributions, credibility intervals (or hypervolumes) and so 
on. As is well-known, if we were able to sample the posterior (even the un-normalized one), i.e. to 
generate points of the parameter space according to their probability, we would have solved the 
problem, at least approximately. For example, the one-dimensional histogram of parameter 9i 
would represent its marginal and would allow the calculation of E(0j) ~ {6i), a{6i) ~ (9f) — {Oi)"^ 
and of probability intervals ((•) in the previous formulae stand for arithmetic averages of the 
MC sample). 

Let us consider the probability function p{x) of the discrete variables x and a function 
f{x) of which we want to evaluate the expectation over the distribution p{x). Extending the 
one-dimensional formula of Tab. Q to n dimension we have 

E[/(a;)] = ^•••^/(xi,...,x„)p(xi,...,xn) (109) 
= ^f{xi)p{xi), (110) 

i 

where the summation in Eq. H109|) is over the components, while the summation in Eq. H110|) is 
over possible points in the n-dimensional space of the variables. The result is the same. 

If we are able to sample a large number of points N according to the probability function 
p{x), we expect each point to be generated rrii times. The average {f{x)), calculated from the 
sample as 

(/N) = ^E/(^*)' (111) 

t 

(in which the index is named t as a reminder that this is a sum over a 'time' sequence) can also 
be rewritten as 

{fix)) = (112) 

i 

just grouping together the outcomes giving the same Xi. For a very large N, the ratios rrii/N 
are expected to be 'very close' to p{xi) (Bernoulli's theorem), and thus {f{x)) becomes a good 
approximation of E[/(a;)]. In fact, this approximation can be good (within tolerable errors) even 
if not all rrii are large and, indeed, even if many of them are null. Moreover, the same procedure 
can be extended to the continuum, in which case 'all points' (oo") can never be sampled. 

For simple distributions there are well-known standard techniques for generating pseudo- 
random numbers starting from pseudo-random numbers distributed uniformly between and 
1 (computer libraries are available for sampling points according to the most common distri- 
butions). We shall not enter into these basic techniques, but will concentrate instead on the 
calculation of expectations in more complicated cases. 
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9.2.2 Rejection sampling 



Let us assume we are able to generate points according to some function g{x), sucli tliat, given 
a constant c, p{x) < cg{x). We generate x* according to g{x) and decide to accept it with 
probability p{x*)/cg{x*) (i.e. we extract another random number between and 1 and accept 
the point if this number is below that ratio). It is easy to show that this procedure reshapes g{x) 
to p{x) and that it does not depend on the absolute normalization of p{x) (any normalization 
constant can be absorbed in the multiplicative constant c). A trivial choice of g{x), especially 
for simple one-dimensional problems, is a uniform distribution (this variation is known as the 
hit or miss method), though clearly it can be very inefficient. 

9.2.3 Importance sampling 



In this method, too, we start from an 'easy' function g{x), which 'we hope' will approximate 
p{x) of interest, of which in fact we know only its un-normalized expression p{x). However, 
there is no requirement about how g{x) approximates p{x) (but the goodness of approximation 
will have an impact on the efficacy of the method), apart from the condition that g{xi) must be 
positive wherever p{xi) is positive . 

The function g{x) can be used in the calculation of E[/(a;)], if we notice that E[/(a;)] can 
be rewritten as follows: 

_ Jfix)pix)dx 
^[^^"^^1 - mx)dx ^^^^^ 
Jfjx) [p{x)/g{x)] g{x)dx 

J[p{x)/g{x)]g{x)dx ^ > 

Eg [fix)p{x)/gix)] 



Eg [p{x)/g{x)] 



(115) 



where the the symbol Eg is a reminder that the expectation is calculated over the distribution 
g{x). Finally, the strategy can be implemented in the Monte Carlo using Eq. (jllll) for the two 
expectations: 

T?U(^\] ~ ^tf(.xt)p{xt)/gixt) 

From the same sample it is also possible to evaluate the normalization constant, given by the 
denominator of Eq. ()113() . i.e. 

pWd-^Eflg. (117) 

The computation of this quantity is particularly important when we are dealing with model 
comparison and Z has the meaning of 'evidence' (Sect.!?)). 

It easily to see that the method works well if g{x) overlaps well with p{x). Thus, a proper 
choice of g{x) can be made by studying where the probability mass of p{x) is concentrated (for 
example finding the mode of the distribution in a numerical way). Often a Gaussian function 
is used for g{x), with parameters chosen to approximate p{x) in the proximity of the mode, as 
described in Sect. 15.1(11 In other cases, other functions can be used which have more pronounced 
tails, like i-Student or Cauchy distributions. Special techniques, into which we cannot enter here, 
allow n independent random numbers to be generated and, subsequently, by proper rotations. 
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turned into other numbers which have a correlation matrix equal to that of the multi-dimensional 
Gaussian which approximates p{x). 

Note, finally, that, contrary to the rejection sampling, importance sampling is not suitable 
for generate samples of 'unweighted events', such as those routinely used in the planning and 
the analysis of many kind experiments, especially particle physics experiments. 

9.2.4 Metropolis algorithm 

A different class of Monte Carlo methods is based on Markov chains and is known as Markov 
Chain Monte Carlo. The basic difference from the methods described above is that the sequence 
of generated points takes a kind of random walk in parameter space, instead of each point being 
generated, one independently from another. Moreover, the probability of jumping from one 
point to an other depends only on the last point and not on the entire previous history (this 
is the peculiar property of a Markov chain). There are several MCMC algorithms. One of the 
most popular and simple algorithms, applicable to a wide class of problems, is the Metropolis 
algorithm (Metropolis et al 1953). One starts from an arbitrary point Xq and generates the 
sequence by repeating the following cycle, with Xt being the previously selected point at each 
iteration: 

1. Select a new trial point x*, chosen according to a symmetric proposal pdf q{9* \ Ot)- 

2. Calculate the acceptance probability 

A{x* I Xt) = min 

3. Accept X* with probability A{xt,x*), i.e. 

• if p{0*) > p{Ot), then accept x*; 

• ifp{9*) < p{Ot), extract a uniform random number between and 1 and accept x* if 
the random number is less then p{0*)/p{Ot)- 

If the point is accepted, then xt+i = x* . Otherwise Xt+i = xt 

This algorithm allows a jump Xt to Xt+i with probability T{xt+i \ Xt) (the transition kernel) 
equal to A{x* \ Xt) q{0* \ Ot)- The algorithm has a stationary asymptotic distribution equal to 
p{x) (i.e. the chain is ergodic) and ensures detailed balance: 

p{xt+i) T{xt I xt+i) = p{xt) T{xt+i I Xt) . (119) 

By construction, the algorithm docs not depend on the normalization constant, since what 
matters is the ratio of the pdf's. The variation of the algorithm in which the proposal pdf g() 
is not symmetric is due to Hasting (1970) and for this reason the algorithm is often also called 
Metropolis-Hasting. Moreover, what has been described here is the global Metropolis algorithm, 
in contrast to the local one, in which a cycle affects only one component of x. 

The fact that this algorithm belongs to the class of MCMC gives rise to two problems. First, 
each point in the chain has some correlation with the points which immediately preceded it, 
and usually the chain moves slowly (and irregularly) from one region in the variable space to 
another (note also that, if a proposed point is not accepted, the chain keep the same position 
in the next step, and this circumstance can happen several times consecutively). Second, the 
initial part of the sequence is strongly influenced by the arbitrary starting point. Therefore, it 
is necessary to remove the initial part of the chain. 



m) 



(118) 
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Using an MCMC for a complex problem is not an automatic procedure and some tuning is 
needed. One of the important things to choose with care is the proposal function. If too small 
jumps are proposed, the chain moves too slowly and, can even remain trapped in a subregion 
and never sample the rest of the parameter space, if the probability distribution is defined over 
disconnected regions. If too large steps are proposed, the proposed points could often fall in 
very low probability regions and not be accepted, in which case the chain remains stuck in a 
point for many cycles. 

For an interesting, insightful introduction to principles and applications of MCMC see (Kass 
et al 1998). A nice tutorial is given by (Hanson 2000). A recent application of Bayesian 
methods in cosmology, which uses MCMC and contains a pedagogical introduction too, can 
be found in (Lewis and Bridle 2002). For a good treatise, freely available on the web, (Neel 
1993) is recommended. The reader will find that MCMC techniques are used to solve complex 
problems graphically represented in terms of Bayesian networks (also known as belief networks 
or simply probabilistic network). This subject, which has revolutionized the way of thinking 
artificial intelligence and the uncertainty issues related to it, does beyond the purpose of this 
article. The interested reader can find more information in (Pearl 1988, BUGS 1996, Cowell et 
al 1999 and Cozman 2001) and references therein. 

10 Conclusions 

The gain in popularity Bayesian methods have enjoyed in recent years is due to various concep- 
tual and practical advantages they have over other approaches, among which are: 

• the recovery of the intuitive idea of probability as a valid concept for treating scientific 

problems; 

• the simplicity and naturalness of the basic tool; 

• the capability of combining prior knowledge and experimental information; 

• the property permitting automatic updating as soon as new information becomes available; 

• the transparency of the methods, which allow the different assumptions upon which an 
inference may depend to be checked and changed; 

• the high degree of awareness the methods give to the user. 

In this article we have seen how to build a theory of uncertainty in measurement as a straight- 
forward application of the basic Bayesian ideas, without unnecessary principles or ad hoc pre- 
scriptions. In particular, the uncertainty due to systematic errors can be treated in a consistent 
and powerful way. 

Providing an exact solution for inferential problems can easily lead to computational dif- 
ficulties. We have seen several ways to overcome such difficulties, either by using suitable 
approximations, or by using modern computational methods. In particular, it has been shown 
that the approximate solution often coincides with a 'conventional' method, but only under well 
defined conditions. Thus, for example, minimum formulae can be used, with a Bayesian spirit 
and with a natural interpretation of the results, in all those routine cases in which the analyst 
considers as reasonable the conditions of their validity. 

A variety of examples of applications have been shown, or mentioned, in this paper. 
Nevertheless, the aim of the author was not to provide a complete review of Bayesian methods 
and applications, but rather to introduce those Bayesian ideas that could be of help in 
understanding more specialized literature. Compendia of the Bayesian theory are given in 
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(Bernardo and Smith 1994, O'Hagan A 1994 and Robert 2001). Classic, influential books are 
(Jeffreys 1961, de Finetti 1974, Jaynes 1998). Among the many books introducing Bayesian 
methods, (Sivia 1996) is particularly suitable for physicists. Other recommended texts which 
treat general aspects of data analysis are (Box and Tiao 1973, Bretthorst 1988, Lee 1989, 
Gelman et al 1995, Cowell et al 1999, Denison et al 2002, Press 2002). More specific applications 
can be found in the proceedings of the conference series and several web sites. Some useful 
starting points for web navigation are given: 



ISBA book list 
UAI proceedings 
BIPS 
BLIP 

IPP Bayesian analysis group 
Valencia meetings 

Maximum Entropy resources 
MCMC preprint service 



http: / /www. bayesian.org/books/books.html 
http://www2.sis.pitt.edu/ dsl/UAI/uai.html 
http:/ /astrosun. tn.cornell.edu/staff/loredo/bayes/ 
http: / /www. ar-tistc.com/blip. html 
http: //www. ipp.mpg.de/OP/Datenanalyse/ 
http://www.uv.es/~bernardo/valenciam.html 
http: / /omega. albany.edu:8008/maxent. html 
http:/ /www. statslab.cam.ac.uk/~mcmc/ 



I am indebted to Volker Dose and Ken Hanson for extensive discussions concerning the contents 
of this paper, as well as for substantial editorial help. The manuscript has also benefited from 
comments by Tom Loredo. 
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